We chose the Red Wine Quality (see Cortez et al. [1]).

Univariate Plots

str(rwine)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The X variable serves as an index. In this analysis, it is useless and was removed.

Quality is an integer ranging from 0 (worst quality) to 10 (best quality).

rwine$quality.as.factor <- factor(rwine$quality)
levels(rwine$quality.as.factor) <- c("bad",
                                     "below av.",
                                     "average",
                                     "above av.",
                                     "good",
                                     "very good")

Quality is not a categorical variable but could be turned into a categorical variable with levels terrible (0-1), really bad (2), bad (3), below average (4), average (5), above average (6), good (7), very good (8), excellent (9-10) for instance. We instead created a new categorical variable named quality.as.factor using this ranking. The choice of adjectives used to describe the rating from 0 to 10 is evidently a matter of taste. It was not presented that way to the experts, and so, such a transformation might introduce further uncertainty in the rating. Nevertheless, those are the labels that will appear in plots for a better readability.

table(rwine$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The bulk of wines is of averaged quality: There are much more normal wines (with rating 4,5,6) than excellent or poor ones.

ggplot(data=rwine,aes(x=fixed.acidity)) +
  geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=0.2)

Could it be that the few wines with acidity in excess are also those with lots of residual sugar to counteract the acidity ? If not, are those wines of excellent or poor quality ?

ggplot(data=rwine,aes(x=volatile.acidity)) +
  geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=0.03)

Maybe the volatile acidity contributes significantly to the smell of red wines, and wine experts do not underestimate the importance of smell when it comes to taste. Do the few wines with lots of volatile acidity smell unpleasant and were downrated ?

ggplot(data=rwine,aes(x=residual.sugar)) +
  geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=0.2)

Same remarks as above as far as the outliers (those with residual sugar in excess) are concerned. The distribution is bell-shaped with a heavy right-tail.

ggplot(data=rwine,aes(x=free.sulfur.dioxide)) +
  geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=1)

summary(rwine$free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
ggplot(data=rwine,aes(x=total.sulfur.dioxide)) +
  geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=4)

summary(rwine$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

For the vast majority of wines, there seems to be a lower limit to sulfur dioxide. Could this lower limit correspond to the sulfur dioxide that naturally arise from the fermentation process and is thus unavoidable ? Winemakers use it as additive for its antioxidant and preservative properties. We suspect that the very few outliers with sulfur in excess are wines of poor quality (in the next section, a plot reveals that they were rated as good).

ggplot(data=rwine,aes(x=log10(total.sulfur.dioxide))) +
  geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=0.04)

The log10 transformation makes the distribution flatter, slightly bell-shaped.

ggplot(data=rwine,aes(x=sulphates)) +
  geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=0.03)

Sulphates seems to be a synonym for sulfur. If so, it ought to correlate with the sulfur dioxide. The shape of the distribution is indeed somewhat reminiscent of the shape of the distribution for the total sulfur dioxide. The x-axis range is different, maybe a (linearly?) scaled version.

ggplot(data=rwine,aes(x=alcohol)) +
  geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=0.1)

summary(rwine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

On average, the selected red wines have an alcohol level of 10.4%. The median lies by 10.2%.

Univariate Analysis

What is the structured of your dataset?

This tidy data set contains 1599 red wines with 11 variables (or features) on the chemical properties of the wine: fixed and volatile acidities, citric acid, residual sugar, chlorides, free and total sulfur dioxide, density, pH, sulphates and alcohol.

Quality is the output. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (terrible) and 10 (excellent).

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the quality of the wine. We would like to determine which features best explain the quality of a wine (feature selection). We could consider it as a regression analysis since the output is numeric, or, since the output can easily be made categorical, as a classification problem.

It would also be interesting to see if excellent wines have common characteristics or features, and if so, to reveal them. But this objective is difficult because we do not expect that a good wine can be explained by very few variables, even more when those variables can be more or less controlled like the amount of sulfur dioxide. Furthermore, still with this objective in mind, overplotting is a real issue, since by using transparency to circumvent overplotting as is usually done, bad and very good wines tend to be barely visible since we have just too few of them.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It is difficult to a priori pick up a few variables that would for sure explain most of the variations in the wine quality. Tannins and phenolics are commonly mentioned in wine circles when discussing the flavour and quality of wine. There are also responsible for the benefits of red wine on health (cancer prevention, …). However, they are not referred to as variable. A closer look at the variables reveal that it is not true that all features are chemical properties of the wine: Most are indeed chemical properties (e.g. pH) but some are chemicals (e.g. alcohol).

Domain knowledge is invaluable in understanding and analyzing a dataset. So we digged a bit deeper to better understand what makes up a wine.

From The Chemistry of Wine, we learned that red wine is made up of a huge number of different compounds. On average, a red wine contain 86% water and 12% ethyl alcohol. Glycerol makes up around 1% with a variety of acids making up an additional 0.4%. Flavonoids (e.g. tannins and phenolics) comprise just 0.1% of the red wine, but contribute to its color and flavour.

In details: The anthocyanins contribute the majority of red wine’s coloration. This coloration depends on the surrounding acidity. The flavan-3-ols contribute to the bitterness of wine. High alcohol concentrations have been shown to enhance this bitterness. The flanovols have no sensory impact attributed to them. The tannins contribute to the red wine’s astringency, or dryness, as well as its bitterness. They also contribute to the color by combining with the anthocyanins.

None of those compounds are listed as variables. This is curious. We can only hope that they can be uniquely determined by the variables that were retained.

Another interesting variable is sulfur, in fact sulfur dioxide. Why was it listed ? Because almost every bottle of wine have now the label “Contains Sulfites” ? From Sulfur in Wine, Demystified, we learned that sulfur is a natural byproduct of the fermentation process. Furthermore, winemakers use sulfur as a preservative to protect juice and wine from oxidation and the influence of bacteria. As a result, a wine’s pH and alcohol levels will contribute to how much sulfur is added prior to bottling. The lower the pH and the higher the alcohol level, the less sulfur a wine might need. Indeed, the more alcohol a wine has, the more protected it is from oxygen’s effects. Similarly, the lower the pH is, the safer wine is from microbial decomposition. This addition of sulfur is harmless and, except if one is suffering from asthma, it cannot be sensed. Consequently, it should not play a role in predicting a wine’s quality. However, at high doses (when its use is not managed well), its perception in wine is reminiscent of matchsticks, burnt rubber, or mothballs. Wines such as these are often termed sulfitic (see Wine fault).

We gleaned from Acids in wine that acids aid in enhancing the effectiveness of sulfur dioxide to protect the wines from spoilage and can also protect the wine from bacteria due to the inability of most bacteria to survive in low pH solutions. In red wines, acidity helps preserve and stabilize the color of the wine. Additionally, wines with lower pH have redder, more stable colors. Wines with higher pH have higher levels of less stable blue pigments, eventually taking on a muddy grey hue. These wines can also develop a brownish tinge. So we expect pH and acidity (through the variables fixed.acidity and volatile.acidity) to correlate and contribute to the quality of a wine’s color.

Citric acid is only found in very minute quantities in wine grapes. In the EU, use of citric acid for acidification is prohibited. Hence the role of this variable might be negligible.

The role of volatile acidity is best explained in Acids in wine: Most of the acids involved with wine are fixed acids with the notable exception of acetic acid, mostly found in vinegar, which is volatile and can contribute to the wine fault known as volatile acidity. Wines starting out with a high pH level (above 3.5) stand the greatest risk of excessive acetic acid production.

Chlorides are components that make up table salt.

Finally, in Wine Jargon, we are told that the residual sugar is the leftover after fermentation ceases. Residual sugar has a balancing relationship with acidity, so if a wine has sugar, we would probably want a strong acidity too. This balancing effect is very likely to contribute to a wine’s quality.

We could not find anything about sulphates. Most (all?) of those who use this term treat it implicitly as a synonym for sulfur dioxide.

What we have found is fairly vague. So a scatterplot matrix would be welcome to rapidly spot any feature of interest.

Did you create any new variable from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

No, this is already a tidy dataset.

Bivariate Plots

pairs.panels(x=rwine, ellipses=F, lm=T)

Matrix plots are extremely useful because we can rapidly detect which variables strongly correlate with our feature of interest. We remarked that in this respect alcohol and volatile acidity stand out, but in general the correlations are yet moderate to weak. In the following, we will in particular consider almost each variable in turn for its influence on quality to better appreciate the value of the correlation coefficient since the correlation coefficient might be misleading if the relationship is not linear.

ggplot(data=rwine,aes(x=alcohol,y=quality)) +
  geom_point(position = position_jitter(w=0.03,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

Though alcohol is the strongest predictor of quality based on the correlation coefficient (+0.48), a linear model does a poor job: It would always predict red wines of average, above average or good quality. The variance is huge and in the next section we will search for a third variable that could explain it.

ggplot(data=rwine,aes(x=quality.as.factor, y=alcohol)) +
  geom_boxplot()

For wines whose quality is normal or above, the median of the alcohol level increases steadily and steeply as quality improves.

ggplot(data=rwine,aes(x=alcohol)) +
  geom_histogram(color=I('black'), fill=I('#099DD9')) +
  facet_wrap(~quality.as.factor)

The histograms conditioned on quality do not reveal anything unusual (e.g. a structure like multi-modality that we could exploit).

ggplot(data=rwine,aes(x=sulphates,y=quality)) +
  geom_point(position = position_jitter(w=0.01,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

# most variables have outliers that strongly distort the correlation coeff.;
# to some extent, we can overcome this issue by getting rid of samples 
# at the bottom and top percentiles;
# this function accepts a data frame 'df' and a variable named 'v' from 'df',
# lower sample quantile 'p_low' and upper sample quantile 'p_high';
# it removes samples populating the bottom and top percentiles as specified
# by 'p_low' and 'p_high' and returns a data frame.

# notice that since 'v' is a string, we access the column named 'v'
# of the data frame through mydf[[v]] in a function;
# mydf[,v] would not work if v is passed to the function;
# but mydf[,"sulphates"] would be fine;

trim_df <- function(mydf, v, p_low, p_high) {
  
  if ( (p_low >= p_high) |  (p_low<0) | (p_high>1) ) {
    print("invalid probs values!")
  }
  if ( ! is.character(v) ) {
    print("expecting a variable's name")
  }
  if ( ! (v %in% colnames(mydf)) ) {
    print("unknown variable!")
  }
  
  mydf$q_low <- NA
  mydf$q_high <- NA

  for (q in levels( factor(mydf$quality) ) ) {
    
    wines <- subset(mydf, quality==q)
    qt <- quantile(x=wines[[v]], probs = p_high, names=F)
    mydf[mydf$quality==q,"q_high"] <- qt
    qt <- quantile(x=wines[[v]], probs = p_low,  names=F)
    mydf[mydf$quality==q,"q_low"] <- qt
  }
  wo_outliers_df <- subset(mydf, mydf[[v]]>q_low & mydf[[v]]<q_high)
  return(wo_outliers_df)
} 
wo_outliers_rwine <- trim_df(rwine,"sulphates",0.04, 0.96)
ggplot(data=wo_outliers_rwine,aes(x=sulphates,y=quality)) +
  geom_point(position = position_jitter(w=0.01,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

cor(x=wo_outliers_rwine$sulphates, wo_outliers_rwine$quality)
## [1] 0.4186638

As shown on the top plot, sulfates correlates moderately with quality (+0.25). By trimming the bottom and top percentiles off of the distribution of sulphates conditioned on the quality as done on the bottom plot, the correlation coefficient gets up to +0.42.

ggplot(data=rwine,aes(x=quality.as.factor, y=sulphates)) +
  geom_boxplot()

ggplot(data=wo_outliers_rwine,aes(x=quality.as.factor, y=sulphates)) +
  geom_boxplot()

The median of the distribution for sulphates increases as quality gets better (in the top plot, the full dataset was used whereas in the bottom plot outliers were removed).

ggplot(data=rwine,aes(x=sulphates)) +
  geom_histogram(color=I('black'), fill=I('#099DD9')) +
  facet_wrap(~quality.as.factor)

The histograms conditioned on quality do not reveal anything unusual.

ggplot(data=rwine,aes(x=total.sulfur.dioxide, y=sulphates)) +
  geom_point(alpha=1/2)

ggplot(data=rwine,aes(x=free.sulfur.dioxide, y=sulphates)) +
  geom_point(alpha=1/2,position = position_jitter(w=0.4,h=0.01))

Sulfates cannot be made equal to sulfur dioxide: Both plots show that they are unrelated.

wo_outliers_rwine <- trim_df(rwine,"total.sulfur.dioxide",0.04, 0.96)
ggplot(data=wo_outliers_rwine,aes(x=total.sulfur.dioxide,y=fixed.acidity)) +
  geom_point(position = position_jitter(w=1,h=0.1)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

cor(x=wo_outliers_rwine$total.sulfur.dioxide, wo_outliers_rwine$fixed.acidity)
## [1] -0.1229143

The fixed acidity should enhance the effectiveness of sulfur dioxide. If so, we would expect less sulfur dioxide as the wines are getting more acid. This is what the plot depicts, but this trend is extremely faint.

wo_outliers_rwine <- trim_df(rwine,"total.sulfur.dioxide",0.04, 0.96)
ggplot(data=wo_outliers_rwine,aes(x=total.sulfur.dioxide,y=pH)) +
  geom_point(position = position_jitter(w=1,h=0.01)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

cor(x=wo_outliers_rwine$total.sulfur.dioxide, wo_outliers_rwine$pH)
## [1] -0.004638037

If the pH is low, the amount of sulfur that is needed for conservation ought to be small. This domain knowledge is not supported by the plot. Indeed, the correlation coefficient is pretty small (-0.07), in agreement with the top plot. In the bottom plot, outliers were removed, and by doing so, the correlation coefficient gets even smaller.

ggplot(data=rwine,aes(x=total.sulfur.dioxide,y=quality)) +
  geom_point(position = position_jitter(w=0,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

ggplot(data=subset(rwine, total.sulfur.dioxide<200),
       aes(x=total.sulfur.dioxide,y=quality)) +
  geom_point(position = position_jitter(w=1,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red') 

Total sulfur dioxide correlates moderately with quality (+0.25) but there is no linear relationship between both variables as shown in the top plot. This nonlinearity is better exemplified in the bottom plot when restricting the limits on the x-axis to remove outliers.

Two wines with an amazing amount of total sulfur dioxide were rated as good. This is surprising since at such a high level it is likely to be a wine fault. Is there a feature that balances this supposedly detrimental effect of total sulfur dioxide for both outliers ?

ggplot(data=rwine,aes(x=quality.as.factor, y=total.sulfur.dioxide)) +
  geom_boxplot()

ggplot(data=rwine,aes(x=quality.as.factor, y=log10(total.sulfur.dioxide))) +
  geom_boxplot()

The median of the distribution of total sulfur dioxide varies almost quadratically with quality (see top plot). This is more evident after a log10 transformation (see bottom plot).

ggplot(data=rwine,aes(x=total.sulfur.dioxide)) +
  geom_histogram(color=I('black'), fill=I('#099DD9')) +
  facet_wrap(~quality.as.factor)

The histograms conditioned on quality do not reveal anything unusual, except perhaps that the distribution of total sulfur dioxide of good wines is bimodal.

ggplot(data=rwine, aes(x=citric.acid,y=volatile.acidity)) +
  geom_point(position = position_jitter(h=0)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

From the scatter plot matrix, we see that citric acid correlates positively with the fixed acidity (+0.67), which makes sense since it is part of the fixed acidity. It correlates negatively with the volatile acidity (-0.55), a relationship we cannot explain. But this correlation does explain that citric acid impacts the wine quality since volatile acidity does.

ggplot(data=rwine,aes(x=citric.acid,y=quality)) +
  geom_point(position = position_jitter(w=0.01,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

Citric acid correlates moderately with quality (+0.23). From the scatter plot, we would have guessed a somewhat lower correlation coefficient.

ggplot(data=rwine,aes(x=quality.as.factor, y=citric.acid)) +
  geom_boxplot()

We observe that the median of the distribution for citric acid increases steadily as quality gets better. But citric acid was not supposed to play any role.

The outlier with an amount of citric acid close to 1 was rated as below average.

ggplot(data=rwine,aes(x=citric.acid)) +
  geom_histogram(color=I('black'), fill=I('#099DD9')) +
  facet_wrap(~quality.as.factor)

The histograms conditioned on quality do not reveal anything unusual, except that the distribution of citric acid conditioned on good wines is bimodal.

ggplot(data=rwine,aes(x=volatile.acidity,y=quality)) +
  geom_point(position = position_jitter(w=0.02,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

wo_outliers_rwine <- trim_df(rwine,"volatile.acidity",0.04, 0.96)
ggplot(data=wo_outliers_rwine,aes(x=volatile.acidity,y=quality)) +
  geom_point(position = position_jitter(w=0.01,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

cor(x=wo_outliers_rwine$volatile.acidity, wo_outliers_rwine$quality)
## [1] -0.4762093

As shown on the top plot, volatile acidity correlates moderately with quality (-0.39). By removing samples whose cumulative conditional distribution is less than 0.04 and greater than 0.96 as done in the bottom plot, the correlation coefficient gets up to -0.47.

ggplot(data=rwine,aes(x=quality.as.factor, y=volatile.acidity)) +
  geom_boxplot()

ggplot(data=wo_outliers_rwine,aes(x=quality.as.factor, y=volatile.acidity)) +
  geom_boxplot()

The median of the distribution for volatile acidity decreases steadily as quality gets better (in the top plot, the full dataset was used whereas in the bottom plot outliers were removed). This is in accordance with our domain knowledge, namely that volatile acidity is a wine fault. From the box plot, we conclude that volatile acidity could help in discriminating among the quality ratings, but for good and very good wines the (median of the) volatile acidity is already very low and volatile acidity does no longer help.

ggplot(data=rwine,aes(x=volatile.acidity)) +
  geom_histogram(color=I('black'), fill=I('#099DD9')) +
  facet_wrap(~quality.as.factor)

The histograms conditioned on quality do not reveal anything unusual.

ggplot(data=rwine,aes(x=fixed.acidity,y=quality)) +
  geom_point(position = position_jitter(w=0.02,h=0.25))

The plot answers a question we posed in the previous section: The outliers with a high amount of fixed acidity were not rated as bad or very good.

ggplot(data=rwine,aes(x=chlorides,y=quality)) +
  geom_point(position = position_jitter(w=0.02,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

wo_outliers_rwine <- trim_df(rwine,"chlorides",0.04, 0.96)
ggplot(data=wo_outliers_rwine,aes(x=chlorides,y=quality)) +
  geom_point(position = position_jitter(w=0.01,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

cor(x=wo_outliers_rwine$chlorides, wo_outliers_rwine$quality)
## [1] -0.2341884

As shown in the top plot, chlorides correlates faintly with quality (-0.13). After removing the outliers as is depicted in the bottom plot, the correlation is slightly more pronounced (-0.23).

ggplot(data=rwine,aes(x=quality.as.factor, y=chlorides)) +
  geom_boxplot()

ggplot(data=wo_outliers_rwine,aes(x=quality.as.factor, y=chlorides)) +
  geom_boxplot()

However, the median conditioned on quality shows no strong tendency (in the top plot, the full dataset was used whereas in the bottom plot outliers were removed). Some outliers are amazingly salty, but rates as average, which is surprising. So it is likely that salt isn’t a major feature. But salt might soften strongly acidic wines (see Saltiness in Wine). We should check in the next section whether the outliers are strong acidic wines.

ggplot(data=rwine,aes(x=chlorides)) +
  geom_histogram(color=I('black'), fill=I('#099DD9')) +
  facet_wrap(~quality.as.factor)

The histograms conditioned on quality do not reveal anything unusual.

ggplot(data=rwine,aes(x=residual.sugar,y=quality)) +
  geom_point(position = position_jitter(w=0,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

wo_outliers_rwine <- trim_df(rwine,"residual.sugar",0.04, 0.96)
ggplot(data=wo_outliers_rwine,aes(x=residual.sugar,y=quality)) +
  geom_point(position = position_jitter(w=0.05,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

cor(x=wo_outliers_rwine$residual.sugar, wo_outliers_rwine$quality)
## [1] 0.05778014

Residual sugar and quality do not correlate (+0.01), a result that came as a surprise. This is true when the full dataset is used (top plot). Even when outliers are removed (bottom plot), the correlation coefficient is pretty small (+0.06).

ggplot(data=rwine,aes(x=quality.as.factor, y=residual.sugar)) +
  geom_boxplot()

The medians of the box plots conditioned on quality do not present any pattern.

ggplot(data=rwine,aes(x=residual.sugar)) +
  geom_histogram(color=I('black'), fill=I('#099DD9')) +
  facet_wrap(~quality)

The histograms conditioned on quality do not reveal anything unusual.

ggplot(data=rwine,aes(x=residual.sugar,y=fixed.acidity)) +
  geom_point(position = position_jitter(w=0.1,h=0)) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

ggplot(data=rwine,aes(x=residual.sugar,y=fixed.acidity)) +
  geom_point(position = position_jitter(w=0.1,h=0)) +
  stat_smooth(method='lm', formula = y ~ x, color='red') +
  xlim(1,4)

with( subset(rwine, residual.sugar>=1 & residual.sugar<=4),
      cor(residual.sugar,fixed.acidity))
## [1] 0.2501002

The residual sugar balances somewhat the fixed acidity: Both correlates positively (+0.11) as shown in the top plot, and the correlation coefficient jumps to +0.25 if we restrict the amount of residual sugar within sensible limits as done in the bottom plot.

ggplot(data=rwine,aes(x=residual.sugar,y=alcohol)) +
  geom_point(position = position_jitter(w=0.1,h=0)) +
  stat_smooth(method='lm', formula = y ~ x, color='red') +
  xlim(1,4)

with( subset(rwine, residual.sugar>=1 & residual.sugar<=4),
      cor(residual.sugar,alcohol))
## [1] 0.07211727

Since the correlation coefficient is +0.04, the residual sugar is surprisingly uncoupled from the amount of alcohol. Even if the residual sugar is constrained within sensible limits, the correlation coefficient is still pretty small (+0.07), in agreement with the plot.

ggplot(data=rwine,aes(x=pH,y=citric.acid)) +
  geom_point(position = position_jitter(w=0.01,h=0), alpha=1/2) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

ggplot(data=rwine,aes(x=pH,y=fixed.acidity)) +
  geom_point(position = position_jitter(w=0.0,h=0), alpha=1/2) +
  stat_smooth(method='lm', formula = y ~ x, color='red')

The pH correlates negatively (-0.54) with the citric acid as expected (top plot). The variance is certainly due to other acids. Indeed, the correlation is even better (-0.67) between pH and fixed acidity (bottom plot).

ggplot(data=rwine,aes(x=pH,y=quality)) +
  geom_point(position = position_jitter(w=0.01,h=0.25))

Wines with a pH greater than 3.5 are most exposed to excessive acetic acid production. Nevertheless, the plot does not support the assertion that most bad wines have a very high pH.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The scatter plot matrix revealed most of the relationships between variables. We had to zoom in to assess whether the correlation coefficients were not misleading because no linear relationship was holding.

Quality, our variable of interest, depends almost linearly on alcohol but the variance in the prediction is huge. This came as a surprise. In fact, all linear models based on a single predictor fail miserably.

The second variable for its influence on quality was the volatile acidity, once outliers of the distribution of volatile acidity conditioned on quality were removed.

It was also striking that sulfates plays a role once the outliers (due to variability) were removed. Sulfates seems to be mistakingly thought of as being the sulfur dioxide, but in the dataset the sulfate variable is independent of the sulfur dioxide.

Finally, the total sulfur dioxide might influence the quality of a wine but in a nonlinear fashion.

Unexpectedly, the residual sugar has almost no influence on the quality of a red wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It seems that the residual sugar balances the acidity of a wine, but this balancing effect is not as strong as expected.

The residual sugar is unrelated to the amount of alcohol.

What was the strongest relationship you found?

There are many strong relationships between the variables, but none of them is interesting because trivial: The relationship between citric acid and fixed acidity, the relationship between the total sulfur dioxide and the free sulfur dioxide, and eventually the relationship between density and citric acid (only shown in the scatter plot matrix).

Multivariate Plots Section

wo_outliers_rwine <- trim_df(rwine,"sulphates",0.04, 0.96)
ggplot(data=wo_outliers_rwine,aes(x=alcohol,y=sulphates)) +
  geom_point(position = position_jitter(w=0.03,h=0.25),
             aes(color=quality.as.factor),
             size = 4) +
  scale_color_brewer(type = 'div') +
  geom_abline(intercept = 3.8, slope = -0.29)

We chose sulphates and alcohol because both are strong predictors for quality and both do not correlate with each other (they are orthogonal). However, no structure stands out. We only see that we could draw a straight line to roughly separate the bulk of above average, good and very good wines from the rest. This coarse clustering is not particularly helpful.

Choosing volatile acidity or even citric acid instead of sulphates (other strong predictors) is also a dead end.

wo_outliers_rwine <- trim_df(rwine,"sulphates",0.04, 0.96)
ggplot(data=rwine,aes(x=alcohol,y=quality)) +
  geom_point(position = position_jitter(w=0.03,h=0.25),
             aes(color=sulphates),
             size = 6) +
  scale_color_gradient(low="blue", high = "red", limits=c(0.4, 0.9))

This way of plotting, whatever the variables are, is also a dead end since no structure is apparent.

# this function plots variable named 'v1' against variable named 'v2'
# of the data frame rwine faceted over the categorical
# variable rwine$quality.as.factor;
# the plots also include a linear model;
# 'p_low' and 'p_high' are user-defined bottom and top
# sample quantiles respectively for the variable 'v2' exclusively;
# samples populating the bottom and top quantiles
# as specified by 'p_low' and 'p_high' will not be considered; 
# this function prints also the correlation coeff. between 'v1' and 'v2'
# conditioned on quality.as.factor. 

# notice the use of print( ggplot(...) ) to force plotting inside the function;
# 'v1' and 'v2' are strings and so we use aes_string instead of aes
# otherwise both variables (specified through their names)
# would not be recognized;

multiv_plot <- function(v1,v2,low_p,high_p) {
  wo_outliers_rwine <- trim_df(rwine,v2,low_p, high_p)
  print(
  ggplot(data=wo_outliers_rwine,aes_string(x=v1,y=v2)) +
    geom_jitter(size = 2) +
    facet_wrap(~quality.as.factor) +
    stat_smooth(method='lm', formula = y ~ x, color='red')
  )
  for (q in levels(wo_outliers_rwine$quality.as.factor) ) {
    wines <- subset(wo_outliers_rwine, quality.as.factor==q)
    print(c(q,": cor = ",round(cor(x=wines[[v1]], y=wines[[v2]]), digits=2)))
  }
}
multiv_plot("alcohol","sulphates",0.04,0.96)

## [1] "bad"      ": cor = " "-0.68"   
## [1] "below av." ": cor = "  "0.19"     
## [1] "average"  ": cor = " "0.12"    
## [1] "above av." ": cor = "  "0.04"     
## [1] "good"     ": cor = " "0.06"    
## [1] "very good" ": cor = "  "0.12"

We claimed that sulphates and alcohol were orthogonal to each other. This also holds if we conditioned on quality. Care must be exercised for bad wines since the number of bad wines is pretty small. Furthermore, there is no smooth transition in correlation from below average wines to bad wines.

multiv_plot("alcohol","free.sulfur.dioxide",0.04,0.96)

## [1] "bad"      ": cor = " "-0.23"   
## [1] "below av." ": cor = "  "-0.17"    
## [1] "average"  ": cor = " "-0.12"   
## [1] "above av." ": cor = "  "0.06"     
## [1] "good"     ": cor = " "-0.01"   
## [1] "very good" ": cor = "  "0.45"

From the scatter plot matrix, we read off that alcohol and free sulfur dioxide do not correlate. Conditioned on quality, the correlation coefficient increases steadily from -0.23 for bad wines to +0.45 for very good wines. Nevertheless, we cannot leverage this information since trends for above average and good wines are identical.

multiv_plot("alcohol","total.sulfur.dioxide",0.04,0.96)

## [1] "bad"      ": cor = " "-0.38"   
## [1] "below av." ": cor = "  "-0.23"    
## [1] "average"  ": cor = " "-0.16"   
## [1] "above av." ": cor = "  "-0.07"    
## [1] "good"     ": cor = " "0.08"    
## [1] "very good" ": cor = "  "0.43"
multiv_plot("density","total.sulfur.dioxide",0.04,0.96)

## [1] "bad"      ": cor = " "0.32"    
## [1] "below av." ": cor = "  "0.16"     
## [1] "average"  ": cor = " "0.08"    
## [1] "above av." ": cor = "  "-0.01"    
## [1] "good"     ": cor = " "0.03"    
## [1] "very good" ": cor = "  "-0.58"

Same observations as before for the two last plots: Trends are identical for two or three consecutive wine quality categories.

multiv_plot("citric.acid","residual.sugar",0.04,0.96)

## [1] "bad"      ": cor = " "-0.41"   
## [1] "below av." ": cor = "  "0.11"     
## [1] "average"  ": cor = " "0.12"    
## [1] "above av." ": cor = "  "0.18"     
## [1] "good"     ": cor = " "0.18"    
## [1] "very good" ": cor = "  "0.47"
multiv_plot("fixed.acidity","total.sulfur.dioxide",0.04,0.96)

## [1] "bad"      ": cor = " "0.67"    
## [1] "below av." ": cor = "  "0.07"     
## [1] "average"  ": cor = " "-0.09"   
## [1] "above av." ": cor = "  "-0.14"    
## [1] "good"     ": cor = " "-0.08"   
## [1] "very good" ": cor = "  "-0.43"

The information the two last plots convey is dubious since the correlation is almost zero except for bad and very good wines. We prefer plots that exhibit a gradual evolution of the correlation coefficient across all wine’s categories.

multiv_plot("total.sulfur.dioxide","pH",0.04,0.96)

## [1] "bad"      ": cor = " "-0.82"   
## [1] "below av." ": cor = "  "-0.29"    
## [1] "average"  ": cor = " "-0.15"   
## [1] "above av." ": cor = "  "0.03"     
## [1] "good"     ": cor = " "0.28"    
## [1] "very good" ": cor = "  "0.48"

This is the only plot of this kind that exhibits a smooth change in trends from bad to very good wines. However, we are having hard times exploiting this information. A clustering approach would be more beneficial and straightforward to interpret.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We were hoping that a third variable could explain the huge variance we observed when plotting quality versus a strong predictor. But a third variable is of no help in explaining the variance. There are also no striking structures when plotting against a third variable.

Were there any interesting or surprising interactions between features?

When faceting over quality, the correlation coefficient might vary gradually. We do not know how to exploit this information.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No.

Final Plots and Summary

ggplot(data=rwine,aes(x=alcohol,y=quality)) +
  geom_point(position = position_jitter(w=0.03,h=0.25)) +
  stat_smooth(method='lm', formula = y ~ x, color='red') +
  labs(title="Quality vs. alcohol",
       x="alcohol [% by volume]",
       y="quality")

Based on correlation coefficients, alcohol is the strongest predictor for quality. Nevertheless, the correlation coefficient lies slightly below 0.5. As the plot depicts, a linear model for quality would fail. It is a matter of debate whether a straight line passing through the medians of alcohol conditioned on quality would do a better job.

wo_outliers_rwine <- trim_df(rwine,"sulphates",0.04, 0.96)
ggplot(data=wo_outliers_rwine,aes(x=alcohol,y=sulphates)) +
  geom_point(position = position_jitter(w=0.03,h=0.25), aes(color=quality.as.factor), size = 4) +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'quality')) +
  geom_abline(intercept = 3.8, slope = -0.29) +
  labs(title="Sulphates vs. alcohol",
       x="alcohol [% by volume]",
       y=bquote("sulphates [g/" ~dm^3~ "]"))

By adding a third variable, no structure stands out upon which we could rely for classification.

Those observations hold for any other third variable.

It might well be that quality depends nonlinearly upon some variables.

ggplot(data=rwine,aes(x=residual.sugar,y=fixed.acidity)) +
  geom_point(position = position_jitter(w=0.1,h=0)) +
  stat_smooth(method='lm', formula = y ~ x, color='red') +
  stat_ellipse(color="green") +
  xlim(1,4) +
  labs(title="fixed acidity vs. residual sugar",
       x=bquote("residual sugar [g/" ~dm^3~ "]"),
       y=bquote("fixed acidity [g/" ~dm^3~ "]"))

Residual sugar and quality do not correlate, a result that came as a surprise. This could be ascribed to the fact that the fixed acidity balances somewhat the residual sugar.

Evidently, the quality of the dataset is questionable. The quality variable reflects the average grade given by 3 experts. What if the experts would be of very different opinion about a wine’s quality ? We believe, averaging and discarding raw data was not appropriate. By doing so, we lost invaluable information about the uncertainty on a wine’s quality. We could have leverage this information by discarding wines for which there was no broad consensus among the experts.

Furthermore, there is no evidence that the set of variables is sufficient to explain a wine’s quality. We would have welcome additional variables such as amounts of tannins and phenolics.

Reflection

We found this dataset much more difficult to analyze than the pseudo-facebook dataset used throughout the course Data Analysis with R. The reasons are twofold: 1/ there were no obvious variables that would in theory and practice strongly correlate with the feature of interest, and 2/ there were no single third variable that would explain most of the variance observed when plotting the feature of interest vs. a strong predictor.

As already mentioned, it was startling that most of the compounds that are usually associated with the quality of a wine (at least for health reasons) like tannins were not cited as variables. So we had to build our domain knowledge, otherwise we would have been in the dark. Building this knowledge, even partially, was exciting, enlightening and really helpful for the analysis.

The bivariate analysis was not fruitful until we realized that the correlation coefficients were highly skewed by the presence of outliers. By the way, outliers were not predominantly found in wines with bad or very good ratings as we initially supposed.

We were hoping from the multivariate analysis that it would explain much of the variance we observed in bivariate plots. We were disappointed. By adding a third variable, no pattern or structure emerged, that could be useful in classification.

We believe that this dataset should be addressed using another angle of attack: Since quality depends nonlinearly upon some variables, nonlinear techniques might reveal structures that could be exploited.

This dataset (and potentially an extended version of it) is interesting: The impact of a model based on its analysis would be considerable. Indeed, suppose that it is possible to accurately predict a wine’s quality based on some chemicals and / or properties. It would then be possible to reap much more benefits in terms of wine quality when artificially modifying the composition of a wine.

Acknowledgments

This is the third draft of my work and I would like to thank the two anonymous reviewers for their suggestions and comments.

References

[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.